Introduction

How does music relate to the lyrics? It is tempting to think that a song tries to convey some feeling or emotion, and that both the music and lyrics are there to support this message. Let me give you an example. We might expect a song with a slow beat and laid back guitar to talk about laid back topics, maybe a trip to the beach. At the other end of the spectrum, heavy metal would likely concern itself with darker, heavier subjects. However, are these suspicions even true? Let’s put some numbers to the hypothesis that there in fact is a relationship between music and lyrics. In the next sections I’ll take you through a journey where we approach this topic with a statistical mindset, harnessing all the powers that modern technology has to offer along the way.

We’ll start out with picking a large body of music and for each track in there, we are going to collect and store the lyrics. It would be way too cumbersome to scrape all the lyrics from the internet myself, but fortunately the Musixmatch API allows querying lyrics from code in a single API call. For an unpaid account only 30% of the lyrics for a queried track is returned, but that will do for our intents and purposes. I assign every set of lyrics a sentimental, or valency, score automatically using the NLTK package, which offers natural language processing functionalities. A low score indicates a sad feeling, whereas a high score a happy feeling. When the lyrics have a numerical score we can start to answer our question: how does music relate to lyrics?

The research question is still a bit broad. We settled on how to analyze lyrics, but not yet on which aspects on music we’ll focus. We are going to keep the research broad, and explore how the lyrics relate to the four main elements of music, that is melody, harmony, instrumentation and rhythm. To access and preprocess the music properties the Spotify API is used. For each element we will either confirm or refute any hypotheses that intuitively make a lot of sense, but are not (yet) backed up by data.

Let us dive into it!

Corpus

The first order of business is choosing the corpus of music. We have chosen a broad research question and the corpus should reflect this. It must draw inspiration from various genres and contain a large number of songs, only then can we justify general conclusions. Because the heavy lifting in terms of fetching the data we need is done by the Musixmatch and Spotify APIs, this is most certainly possible.

The exhaustive list of albums that are included in this research:

Elephant, Madvillainy, ..Like Clockwork, Street Worms, Midnights, HEROES & VILLAINS, St. Elsewhere, The White Album, Plastic Beach, Demon Days, Thriller, In the Aeroplane Over the Sea, Hawaii: Part ||, WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?, Dua Lipa, The Money Store, OFFLINE!, OK Computer and Rumours

This totals over 280 tracks and 16 hours of listening time.


Playlist

Discovery


The Spotify API offers a plethora of functionalities that range from very high to very low level. Here we will use some the the high level analyses like valence and energy to learn about the corpus.

When we plot energy, musical and lyrical valence values against each other we find something enormously interesting. Clearly, even though energy and valence do not seem related, musical and lyrical valence appear highly correlated.

Melody


Research: can we identify melodies from audio data?

Intuitively, it seems that melody encodes a lot of the valency information of a song. The melody is usually the most memorable part and often indicative of the feel of a song. So it makes sense to look at the melody of two tracks, one with low and one with high lyrical valence, and investigate how melody correlates with lyrical valence. A sensible visualization tool to use is a chromogram. This captures for each moment the notes that are played, as analyzed using the fourier transform. Let’s try this and see if any melody lines become apparent.

Unfortunately, looking at the chromograms, no discernible melody is recognizable. The only thing that sticks out is the droning ‘E’ in Ball and Biscuit, but this could hardly be called a melody. It appears we need a different tool.

TODO: explain what melody the human ear hears, and that it’s not detected

Melody/Harmony


Research: what is the saddest key?

Apparently it’s difficult to find melodies when faced with a chromogram. Instead of identifying specific melody lines, we could focus on the key in which the melody is played. Luckily Spotify gives us the key and mode of every track in our corpus, so we don’t have to compute this ourselves. When we plot the lyrical valency for each key, where the band around the bars represents the number of tracks that have that specific key, we get the bar plot to the side. What meets the eye, is a huge spike at the D sharp (or E flat) key. What could this mean? Unfortunately not a lot, because upon closer inspection it appears that songs in that key are for heavily underrepresented in the corpus. There does seem to be quite a bit of variation among keys, especially the keys in B, which appear to affect lyrics in a negative way. This points to the fact that there in fact is such a thing, like “the saddest key” (which would be B major).

Although this could be coincidental and the effect may be cancelled out if the corpus were much larger. Somethings that favor this conclusion are the average lyrical valencies, which converge to a lyrical valency of ~0.13. There is no significant distinction between the average major and minor mode, even though we’re always told that minor keys are “sad” and major keys are “happy”. These results deny what our music teachers have been telling us for centuries!

Harmony detection limitations


I should address some issues with automatic key matching that explain why we should take the idea that B major is the saddest key with an even larger grain of salt. Key matching might work for a lot of tracks, but there are many cases where it fails too. Where key matching fails most spectacularly, is for highly percussive tracks, which is usually the case for hip hop. This is due to the inharmonic nature of most percussive instruments. Take for instance UNTITLED by JPEGMAFIA. Upon listening, the energetic hi hats and fast drum kicks stand out. This is manifested in the corresponding keygram, where for each section every key on the y-axis is matched. The brighter the tile, the more strongly the key matched. We’d expect a straight line, that changes height after a modulation. But in the UNTITLED track there is no such pattern to be found.

Another issue is brought forward by a limitation by the Spotify API, that is, only two unique modes can be distinguished by the API (the major and minor mode). This is problematic, because many artists apply many different modes to achieve a variety of effects that cannot be achieved by just minor or major keys. The song Electioneering by Radiohead is in D-dorian, which is the minor key with a raised sixth. The result is that the key lies somewhere in between D-minor and D-major, which is reflected in the keygram.

Though we could match the keys ourselves for every track in the corpus, another issue would present. To match every possible mode, the search space would become too big and our results too cluttered to identify a specific key, as multiple keys would always match somewhat.

Instrumentation

Rhythm


Hypothesis: higher tempo songs tend to be more aggressive and slower songs more sensual.

Let’s put this one to the test. For this hypothesis we’ll denote songs that have a lower BPM than the median (< 115.9BPM) as slow songs, and the remainder as fast songs (≥ 115.9BPM).

So far we’ve explored only the lyrical valency property, but not the lyrics themselves. We might gain some new insights if we look at the lyrics directly, so let’s try it. One of the most useful tools for visualizing patterns in textual data is a so called word cloud, which you can see to the side. The words in blue refer to words that occur very frequently in fast songs relative to slow songs, and vice versa for the red words.

Immediately we can see instances that prove the hypothesis. Slow song words include sensual words such as number (as in, someones phone number), kiss, boy and hot. These are words we would expect to encounter in a love song. Though what stands out is that love is included in the fast songs. There are also some odd ones out like bones. As for the fast tracks we also find what one would expect, e.g. aggressive words like kill, gun and ill. Also, very noticeably, we find numerous verbs and filler words. This makes sense in a track where the singer (or rapper) has to keep up the pace in a high BPM track, and it’s easiest for the listener and artist to reuse many of the common verbs and filler words to keep the information stream somewhat limited.

Most of the data in this plot seems to confirm the hypothesis (though there are exceptions, like love among the fast tracks).

AI: hyperparameter tuning


Research: are there some hidden relationships that have yet to be found?

So far we have plotted numerous relationships between different variables and debunked or confirmed a number of hypotheses. Though the reason I picked those visualizations is because there seemed to be potential for an interesting correlation, there is still a chance there exist some totally unexpected patterns. Because these may escape a mere human like me, maybe the right machine learning tool can pick them up. So that’s what we’ll be trying.

The technique we’ll use is one of the most successful machine learning algorithms of the modern day, called extreme gradient boosting (XGBoosting). In essence, it’s an ensemble of many small decision trees that boost each other to achieve superior results. To uncover secret relationships we will train the network, using regression, to predict the lyrical valency of a song based on a whole array of inputs that the Spotify API delivers (like mode, tempo, musical valence, etc.). Before we dive in and use it, some preparations need to be taken care of. First, the corpus is split into a train set (to train the model) and test set (to evaluate the model). Next up, for XGBoosting to work we need to list a number of hyperparameters. The next step then is to tune these hyperparameters, by training the model for many different combinations of hyperparameters and evaluating each using cross validation. In the plot you can see how well each parameter value works: the lower RMSE, the better. Now that we know the optimal combination we can finally properly train and evaluate the model.

AI: results


After training the model, we end up with a RMSE (root mean square error) of 0.7488581. This means that the model, on average, is that far off from the correct answer. As a reminder: the lyrical valency ranges from -1 to 1. It could well be that this still sounds quite abstract. As a comparison I also evaluated a model that always makes a random guess. That model performs with a RMSE of 0.879256, which is significantly worse. Therefore, the XGBoosting model must have found some pattern. That makes it worth to look at whatever it found. One valuable piece of information that we can extract from the XGBoosting model, is the set of feature importances it learned. They tell us how important each input parameter is deemed for predicting the lyrical valency, which you can see in the plot.

Many of the findings we found ourselves already. We already discovered that the mode of the key does not matter. We found that musical valence and tempo correlate with lyrical valency. More interesting is what we did not find. Apparently, according to the model, is energy the most predictive factor of the lyrical valency, even more so than musical valency. This makes sense intuitively. High energy songs might be more likely to have more energetic lyrics. The model also judges loudness and danceability as somewhat important features. A reason could be that the genre of a track defines in which range those features, including energy, belong, and that the genre also defines what the lyrics are generally about.

Conclusion

To be concluded.